Solution: Partition and Normalize
Let’s learn how to solve the antipattern by partitioning and normalizing.
We'll cover the following
There are better ways to improve performance if a table gets too large instead of splitting the table manually. These include horizontal partitioning, vertical partitioning, and using dependent tables.
Most database brands can handle many more tables and columns than we would need if we were to use a sensible database design. If we feel that we might exceed the maximum, it’s a clear sign that we need to rethink our design.
Using horizontal partitioning#
We can gain the benefits of splitting a large table without facing any drawbacks by using a feature that is called horizontal partitioning or sharding. We define a logical table with some rules for separating rows into individual partitions, and the database manages the rest. Physically, the table is split, but we can still execute SQL statements against the table as though it were whole.
We have flexibility in that we can define the way each table splits its rows into separate storage. For example, using the partitioning support in MySQL version 5.1, we can specify partitions as an optional part of a CREATE TABLE
statement.
The previous example achieves a partitioning similar to what we saw earlier in this chapter, i.e., separating rows based on the year in the date_reported
column. However, it has advantages over splitting the table manually, in that the rows are never placed in the wrong split table, even if the value of the date_reported
column is updated, and we can run queries against the Bugs
table without the need to reference individual split tables.
The number of separate physical tables used to store rows is fixed at four in this example. When we have rows spanning more than four years, one of the partitions will be used to store more than one year’s worth of data. This will continue as the years go on. We don’t need to add new partitions unless the volume of data becomes so great that we feel the need to split it further.
Partitioning is not defined in the SQL standard, so each database brand implements it in its non-standard way. The terminology, syntax, and specific features of partitioning vary between brands. Nevertheless, some form of partitioning is now supported by every major brand of the database.
Using vertical partitioning#
Whereas horizontal partitioning splits a table by rows, vertical partitioning splits a table by columns. Splitting a table by columns can have advantages when some columns are bulky or seldom needed.
The BLOB
and TEXT
columns have variable sizes, and they may be huge. For efficiency of both storage and retrieval, many database brands automatically store columns with these data types separately from the other columns of a given row. If we run a query without referencing the BLOB
or TEXT
columns of a table, we can access the other columns more efficiently. But if we use the column wildcard *
in your query, the database retrieves all columns from that table, including any BLOB
or TEXT
columns.
For example, in the Products
table of our bugs database, we might store a copy of the installation file for the respective product. This file is typically a self-extracting archive with an extension such as .exe
on Windows or .dmg
on a Mac. The files are usually huge, but a BLOB
column can store binary data of enormous size.
Logically, the installer file should be an attribute of the Products
table. But in most queries against that table, we wouldn’t need the installer. Storing such a large volume of data in the Products
table, which we use infrequently, could lead to inadvertent performance problems if we’re in the habit of retrieving all columns using the *
wildcard.
The remedy is to store the BLOB
column in another table, separate from but dependent on the Products
table. We can make its primary key serve as a foreign key to the Products
table as well to ensure that there is one row per product row.
The previous example may be an extreme case to make the point, but it shows the benefits of storing some columns in a separate table. For example, in MySQL’s MyISAM storage engine, querying a table is most efficient when the rows are of a fixed size. Since VARCHAR
is a variable-length data type, the presence of a single column with the VARCHAR
data type in a table prevents the table from gaining that advantage. If we store all variable-length columns in a separate table, then queries against the primary table can benefit (even if only a little bit).
Let’s insert data in both of the tables, Bugs
and BugDescriptions
, as given below:
Now, as we have inserted the relevant data, let’s try to retrieve this data from the tables in the following playgrounds.
The following code widget shows the data available in the
Bugs
table. We can also retrieve the data from theBugDescriptions
table, and we can also retrieve the results from both tables usingJOIN
. The example code isSELECT * FROM Bugs INNER JOIN BugDescriptions ON Bugs.bug_id=BugDescriptions.bug_id;
.
Fixing metadata tribbles columns#
The solution to fixing metadata tribbles columns is similar to the solution we saw in the chapter Multicolumn Attributes, i.e., the remedy for metadata tribbles columns is to create a dependent table.
Instead of one row per project with multiple columns for each year, we can use multiple rows with one column for fixed bugs. If we define the table this way, we don’t need to add new columns to support subsequent years. Instead, we can store any number of rows per project in this table as time goes on.
Let’s suppose we want to run a query to search many tables at once and that all the tables have the same columns.
We would need the tables with an identical structure, and we would need to have stored them together in a single table with an extra attribute column to distinguish the rows.